nhslogo CS4132 Data Analytics

Crosswords: Accessibility and Representation

by Chua Wee Chong

Important Note: Please keep your report concise and relevant (i.e. show only relevant steps and visualizations used to answer your research questions).

Table of Content (with relevant hyperlinks to sections)

Motivation and Background

Give an overview of the project, motivation, background and goals.

American crosswords are word puzzles, in which the goal is to fill in all the white squares with letters to fit the given clues. The special rules of American crosswords in particular are that each square must be used twice and each word has to be at least 3 letters long. The answers are a mix of trivia and common phrases. There are two main criticisms of American crosswords.

The first criticism is how accessible it is. Sometimes, obscure terms must be used to fill the grid, as the constructor cannot find a better configuration. These obscure terms are called "crosswordese". They appear commonly in puzzles as they have convienent letter patterns. However, as computers have evolved to assist humans in construction, the quality of puzzles have been getting better and better.

The second criticism is how representative it is. Crosswords have been said to reflect a piece of personality from the contructor. Originally, crosswords were for straight liberally-educated white men. As time passed, people realised that there was a lack of representation in terms of answers or clue-writing for the other groups. Hence, there has been a push to include more of these people into the crossword. This includes mentorship to women/people of colour/LGBTQ people to construct crosswords.

This project would aim to find how accessible the crossword currently is, given the rise of computers as a construction aid. In a similar fashion, it would also like to find out how representative the crossword is of minorities.

Summary of Research Questions & Results

Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer

Accessibility:

1. "Crosswordese"

How has the amount of obscure answers changed throughout the years? Crosswordese is the use of an obscure word with a convienent letter pattern, with many common letters or vowels sometimes to fill in the grid. These words make it hard for people to do them, if they are not part of an "in-group" that knows all these common crossword words. Hence, I would like to find out, how much "crosswordese" the crossword has over the years.

  • Crosswordese has only somewhat decreased through the years, not a very significant change.
  • Crosswordese slightly increases throughout the week, making it less accessible for newbies.

  1. Freshness

How has the "freshness" factor of crossword changed over the years? Crosswords are a reflection of the world, what the current trends are and such. With more and more crosswords in the pool, the number of never seen before terms and names in the crossword has steadily decreased. However, words and phrases get coined every single day, some catching on in modern language. This question aims to investigate that, coupled with how computers have helped give more liberty to filling the grid.

  • Freshness has increased throughout the years, giving a better puzzle.
  • Freshness increases through the week, as puzzles become more loose in theme.

    Representation:

    3. Inclusive Clues

How has the clue-writing changed over the years? Clue-writing is half of the puzzle. This could introduce some unwanted sterotypes. For example, the clue-writing for the answer MIT has been associated with males more than females, showing a bias in thinking that men are more prominent in tech. Hence, by finding out the mentions of minorities in clue-writing, one can find out how progressive the puzzle has become.

  • Female names are used less than male names in clues and answers combined.
  • However, the use of names has also decreased through the years.
  • Some outlets strive to make female names used as much as male names, and a shift to equality has been seen in recent years.

    4. Constructors

How has the make-up of constructors changed over the years, and modified the quality of crosswords? As said, a crossword reflects a person's experiences and views. With more diverse make-up of constructors, there will be more variety for that. There was a time where the constructors were mostly men, skewing the quality for some. With the rising number of mentorships given to minorities by prolific constructors, however, there has been an uptick in minority constructors. This question would like to analyse the trend as a whole, as well as possibly see the impact of mentorship.

  • Women and new constructors are seen to mostly construct only early-week puzzles.
  • Female representation in construction has positive impacts on inclusivity.
  • Collaborations boost the quality of crosswords for all.

In [11]:

Datasets

Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
  1. https://www.crosswordgiant.com/browse (website to scrape for the clue answer pairs)
  2. https://www.xwordinfo.com/ (contains more data about the NYT Crossword in particular)
  3. https://books.google.com/ngrams/ (for word searching)
  4. https://peterbroda.me/crosswords/wordlist/lists/peter-broda-wordlist__scored.txt (crossword construction wordlist by Peter Broda)
  5. https://drive.google.com/uc?export=download&id=1Ruxn8XzRNstU6sDPOMm_K72fVookrPPr (crossword construction wordlist by Brooke Husic and Enrique Henestroza Anguiano)
  6. https://www.verywellfamily.com/top-1000-baby-boy-names-2757618 (boy name list)
  7. https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 (girl name list)

Methodology

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).

Data Acquisition

Display the data which will be used in the project.

The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report.

Import and display each dataset in a dataframe.

For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.

Many of these datasets were scraped from the internet. The scraping code can be found in the Appendix.

CrosswordGiant.com

This data was scraped from CrosswordGiant.com, looking at all the pages, then filtering and sorting the relevant publication outlets. The 5 DataFrames below are very similar, they differ only in the outlet column. Inside each DataFrame, there are 4 columns.

Clue - What hint was given to the answer in this crossword?

Answer - What is the expected answer to the given clue?

Outlet - Where was the crossword published?

Date - When was the crossword published?

In [12]:
Out[12]:
Clues Answers Outlet Date
0 'High priority!' RUSH New York Times Jan 27 1997
1 'We're number ___!' ONE New York Times Jan 27 1997
2 '___ Blue?' (1929 #1 hit) AMI New York Times Jan 27 1997
3 A.F.L.'s partner CIO New York Times Jan 27 1997
4 Adjusts to fit ADAPTS New York Times Jan 27 1997

XWordInfo.com

This dataset is obtained from XWordInfo, a site with extensive information on New York Times Crosswords. The dataset is named NYTCI, standing for New York Times Constructor Info. The dataset has 8 columns. First two are the Day Of Week and Date. Some crosswords made are collaborations, made with up to 3 people, hence C1,C2 and C3, which stand for Constructor 1, 2 and 3 respectively. For some crosswords, there are less than 3 constructors, hence their columns are dashed. C1,C2,C3 No. stand for the order of the puzzle the author has published up til now. C1, C2, C3 Gender stand for the genders of Constructors.

In [13]:
Out[13]:
Day Date C1 No. C1 Gender C2 No. C2 Gender C3 No. C3 Gender
0 Saturday January 1, 1994 puzzle # 12 Mr - - - -
1 Sunday January 2, 1994 puzzle # 5 Mr - - - -
2 Monday January 3, 1994 puzzle # 155 Mr - - - -
3 Tuesday January 4, 1994 the debut puzzle Mr - - - -
4 Wednesday January 5, 1994 puzzle # 13 Mr - - - -

Google NGram

Taking the some of the most common answers over all crosswords, around 15K, we input it into Google NGram as an URL, which gives us scrapable data. We then take the relevant years of our crosswords, 1990-2020, and insert it into a DataFrame. However, sometimes the data is missing, for example I found no data for "ISNT". Since this is uncommon and shows no pattern that can be seen, it is reasonable to assume randomness and just ignore it. Here, there are 3 different files as some answers were scraped in different sessions The resulting DataFrame is displayed.

In [14]:
Out[14]:
ERA AREA ORE ALOE ERIE ONE ERE ARIA ALE ATE ... itunes angler fracas exs strands lender antihero suedes esker petrol
Year
1990 0.000003 0.000007 4.092760e-07 3.317504e-08 2.779331e-07 0.000009 1.515894e-07 6.222778e-08 4.411864e-07 4.286382e-07 ... 3.051807e-11 6.585536e-07 1.608488e-07 1.244890e-07 0.000004 0.000008 5.290083e-08 7.561240e-09 8.216753e-08 0.000002
1991 0.000002 0.000007 4.053728e-07 3.388514e-08 2.709312e-07 0.000009 1.486782e-07 6.383107e-08 4.381159e-07 4.241332e-07 ... 3.259970e-11 6.519289e-07 1.605182e-07 1.120769e-07 0.000004 0.000008 5.585587e-08 7.518122e-09 7.954383e-08 0.000002
1992 0.000002 0.000007 3.924445e-07 3.322866e-08 2.769078e-07 0.000009 1.500604e-07 6.771065e-08 4.270695e-07 4.216174e-07 ... 3.421498e-11 6.445361e-07 1.613821e-07 1.089869e-07 0.000004 0.000008 5.705645e-08 7.448963e-09 7.593409e-08 0.000002
1993 0.000002 0.000007 3.799626e-07 3.290970e-08 2.682965e-07 0.000009 1.496317e-07 6.641682e-08 4.092451e-07 4.072851e-07 ... 3.041332e-11 6.360911e-07 1.625221e-07 1.054417e-07 0.000004 0.000008 5.713023e-08 7.465821e-09 7.266345e-08 0.000002
1994 0.000002 0.000007 3.657307e-07 3.326257e-08 2.661885e-07 0.000008 1.495085e-07 6.875681e-08 4.011461e-07 4.004407e-07 ... 3.178341e-11 6.346831e-07 1.645288e-07 9.975008e-08 0.000004 0.000008 5.877106e-08 7.244199e-09 7.050995e-08 0.000002

5 rows × 25153 columns

Crossword Wordlists

In crossword construction, wordlists are used. These wordlists are fed into a program, which will help suggest the best configuration for a particular section, or even for the whole grid. As such, the wordlists are as comprehensive as possible, trying to maximise the number of configurations to pick from, to pick the best one to human eyes. These wordlists are usually scored by the author as well, giving a score of how good, in their opinion, an answer is. Using these wordlists, we can also check how "good" each crossword is, giving a quantifiable amount of weight to each asnwer.

Of course, these wordlists are biased based on who makes it. Hence, we shall try to use 2 independent wordlists to cross-check. There are many wordlists out there, however, these two are chosen as they are very comprehensive, but also free.

In [15]:
Out[15]:
Score
Answer
STY 80
SIGN 85
TRIO 50
YARD 85
DRAYS 50
In [16]:
Out[16]:
Score
Answer
AAA 50
AAAA 40
AAAAAAAAAAAAAAA 30
AAAAAH 20
AAAADDRESS 20

Girl's/Boy's Name Database

This dataset was acquired by simply going to the website and doing a copy-paste. This database will be used in Q3 to look for occurrences of their names. It will only be used as a lookup, not really a dataframe.

In [17]:

Data Cleaning

For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.

Since most of the data is scraped, I have been able to control the cleaniness of data, therefore, the quality and cleaniness of the data was high. Of course there were some hitches during the data collection. Missing data is rare and may not exist in the dataset.

For two of the datasets, CrosswordGiant and NGram, there was the possibility of the data not existing. This was simply handled by catching the exception/error that occurs when I tried to process the empty data. Hence, it is ensured that no data that is invalid is entered into the saved file. Again, the code is found in the Appendix.

For the namelist, no cleaning is required; that has already been done by the publisher. For checking symmetry, the dataset is very simple and acquired by scraping. Although some of the entries are incorrect, they are at random. This was caused by a logic error that I did not have the skills to fix. However, this should not affect the results significantly. However, no cleaning is requried.

CrosswordGiant

For this dataset in particular, some webpages have garbage data, with the answers just being XXXXXXXX or being duplicated many times. The former is harder to detect, and can be cleaned later, when answering question 1. The latter can easily be removed by checking how many entries the crossword of the particular day has, then just removing them.

Sometimes, publications fill their crosswords with puns, which are bogus words, without a theme. These gimmicks are hard to detect, and unfortunately CrosswordGiant is unable to detect such cases. This problem is difficult to solve, as it is a linguistical one, and not within the scope of this project. Some introduction is required here. Bogus words follow a theme, and themed crosswords appear on certain days of the week only. By and large, it is reasonable to assume that bogus words appear at random, and are independent between crosswords.

Hence, these will be the steps for cleaning this dataset.

  1. Given the date, find the day of the week.
  2. Toss out known puzzles with bogus words.
  3. Find the puzzles with too many clues and discard them. Then, we just merge them.

While doing the project, I found that this dataset was missing some New York Times crossword from around 2000. This missing it is not crucial to the project, and hence can be ignored.

In [18]:
In [19]:
In [20]:
In [21]:
Out[21]:
Outlet Date Answers
5654 New York Times 1997-03-09 170
5738 New York Times 1997-06-01 162
5808 New York Times 1997-08-10 170
5948 New York Times 1997-12-28 171
5969 New York Times 1998-01-18 164
In [22]:
In [23]:

XWordInfo

Curiously, their formatting is sometimes irregular. The numbering system calls the first puzzle "the debut puzzle" and others "puzzle # n". For puzzle number, this is an easy fix. They also call people Mr or Ms, depending on their gender. This was easy to replace. The harder part however, was the inconsistencies in their data. For some people, Their name was used instead of Mr X or Ms X. Since those are very few, I have taken the step to clean it by hand.

There is a person's gender as "A". This arose due to how the scraping was done. Visiting the website, I have found the persons name and found out that their name is male.

In [24]:
In [25]:

Google NGram

The interesting thing about Google NGram is that it returns different results based on the capitalisation of the word. Hence, I tried both all caps and no caps form of the word. This has yielded a DataFrame with two of the same words. This is largely easy to clean, we just need to add the two columns together. The function used yields a nice sorted order.

However, DataFrames are a terrible lookup table, hence I have chosen to convert them into dictionaries

In [26]:
Out[26]:
AAA AAAS AAH AAHED AAHS AANDE AAR AARE AARGH AARON ... ZONES ZONK ZOO ZOOM ZOOS ZOOT ZORRO ZOWIE ZSA ZULU
Year
1990 0.000003 4.776738e-07 8.037317e-08 1.080322e-08 2.915548e-08 3.935635e-10 7.202219e-07 1.310561e-08 1.340437e-09 2.204129e-07 ... 0.000022 3.301622e-09 0.000002 0.000001 6.868482e-07 5.572532e-08 1.697215e-08 2.356272e-09 2.278728e-09 3.799622e-08
1991 0.000003 4.676968e-07 8.048573e-08 1.121164e-08 3.127718e-08 4.146365e-10 7.203304e-07 1.271489e-08 1.632048e-09 2.189614e-07 ... 0.000022 3.280857e-09 0.000002 0.000001 6.796661e-07 5.810294e-08 1.706612e-08 2.296641e-09 2.178643e-09 3.731547e-08
1992 0.000003 4.477974e-07 8.223795e-08 1.160830e-08 3.238063e-08 3.912581e-10 7.150338e-07 1.272449e-08 1.558918e-09 2.113534e-07 ... 0.000022 3.240615e-09 0.000002 0.000001 6.747638e-07 8.753073e-08 1.745364e-08 2.390806e-09 2.133923e-09 3.728858e-08
1993 0.000003 4.544412e-07 8.328433e-08 1.200256e-08 3.395920e-08 4.073982e-10 7.179115e-07 1.231911e-08 1.673835e-09 2.096335e-07 ... 0.000021 3.327646e-09 0.000002 0.000001 6.718674e-07 8.726372e-08 1.813236e-08 2.373550e-09 2.035918e-09 3.957867e-08
1994 0.000003 4.548627e-07 8.400814e-08 1.221837e-08 3.563201e-08 4.240070e-10 7.143078e-07 1.225623e-08 1.638794e-09 2.121875e-07 ... 0.000021 3.436024e-09 0.000002 0.000001 6.640938e-07 8.800427e-08 1.838808e-08 2.405292e-09 1.947023e-09 3.881467e-08

5 rows × 15571 columns

In [27]:
Out[27]:
3.226014294971885e-06

Although data cleaning is sparse in this project, it is compensated by the large amount of transformation of data in the EDA. This is caused by the data being scraped and concrete research into this niche area being rather lacking.

EDA

For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered.

Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.

Q1. "Crosswordese"

Firstly, let us define "short answers" as anything with at most 7 letters, and "long answers" as anything with at least 8 letters For each crossword, we will do the following:

  1. Group all the clues from the specific day
  2. Filter the DataFrame to only have short answers
  3. For each puzzle, for each short answer that appears on the NGram table, check it against the corresponding year that it appeared, and add it to a total, call this number the "Score"
  4. For each puzzle, for each short answer, compare it against the wordlist and give it the corresponding score, then add it all up for that outlet's daily crossword We will obtain a DataFrame with the following: Day, Date, Outlet, Score, AScore, BScore This is our base data for graphing. This data is saved in "Q1 Data.csv" Using the score, we can determine how much crosswordese is in it generally. The higher the score, the better the puzzle. Then, we plot to observe any trends. There is a limitation to this method. Some words are not found on the list and as such, I am unable to score them properly, hence I chose to give it a score of 0, as a baseline. This problem appears more when using the NGram dataset, as there are less entries and it is less extensive. However, it still provides a reasonably good image, as crosswords should be affected similarly by missing entries.
In [28]:
In [29]:
In [30]:
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\369329642.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  short["AScore"]=0
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\369329642.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  short["BScore"]=0
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\369329642.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  short["Score"]=0
In [31]:

In the above graph, there does not seem to be much correlation between the day of week and how much crosswordese is present when we test it with wordlist A.

In [32]:
Out[32]:
Text(0, 0.5, 'Short score')
In [33]:
Out[33]:
Text(0, 0.5, 'Short score')

Somehow, when using the NGram dataset, which shows real life usage of these words, an interesting trend emerges. There seems to be 3 distinct sections of the stripplot. A more detailed discussion will be included in the results.

In [34]:
In [35]:
In [36]:
Out[36]:
Text(0.5, 6.79999999999999, 'Year')
In [37]:
In [38]:

We can see that there is a slight trend upwards, showing improvement in the accessibility in terms of wordlist metrics. However, such a trend cannot be observed with NGram data. The dip in 2000 can be explained by missing data.

Q2 Freshness

The proceedure to answer this question will be similar to Q1, where the wordlist will be used for comparison. For the wordlist score, I will just use it accordingly. However, for my own scoring of freshness of long answer, I will be scoring it on a harmonic scale, with the i th occurance of the answer having a score of 1/i. Then, to evaluate the score of the crossword, I will just sum it up. Also, since long answers are rather rare in a crossword, counting them makes sense, so that will be taken into account too.

Afterwards, these points will just be plotted, to see if there are any trends to be spotted

In [39]:
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\4058804564.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long["Score"]=0
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\4058804564.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long["AScore"]=0
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\4058804564.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long["BScore"]=0
Out[39]:
Clues Answers Outlet Date Day Year Score AScore BScore
0 Big name in video rentals BLOCKBUSTER New York Times 1997-01-27 0 1997 0 0 0
1 Cyclotron ATOMSMASHER New York Times 1997-01-27 0 1997 0 0 0
2 Exhausting task BACKBREAKER New York Times 1997-01-27 0 1997 0 0 0
3 Yegg SAFECRACKER New York Times 1997-01-27 0 1997 0 0 0
4 'Little' extraterrestrials GREENMEN New York Times 1997-01-28 1 1997 0 0 0
... ... ... ... ... ... ... ... ... ...
223495 Amusing poser BRAINTEASER Wall Street Journal 2022-07-06 2 2022 0 0 0
223496 Bob Feller and Nolan Ryan, by reputation FLAMETHROWERS Wall Street Journal 2022-07-06 2 2022 0 0 0
223497 Feeding the hungry, say ACTOFMERCY Wall Street Journal 2022-07-06 2 2022 0 0 0
223498 One whose beliefs could use some rounding out? FLATEARTHER Wall Street Journal 2022-07-06 2 2022 0 0 0
223499 Station posting TRAINSCHEDULE Wall Street Journal 2022-07-06 2 2022 0 0 0

223500 rows × 9 columns

In [40]:
In [41]:
In [42]:
In [43]:
Out[43]:
Outlet Date Day Score AScore BScore Count
0 L.A. Times Daily 2005-07-02 5 7.250000 405 360 9
1 L.A. Times Daily 2005-07-03 6 7.000000 565 370 10
2 L.A. Times Daily 2005-07-04 0 2.767857 385 350 7
3 L.A. Times Daily 2005-07-05 1 4.833333 339 250 6
4 L.A. Times Daily 2005-07-06 2 2.750000 290 220 5
... ... ... ... ... ... ... ...
28236 Wall Street Journal 2022-06-22 2 6.000000 96 170 6
28237 Wall Street Journal 2022-06-25 5 14.697169 1000 1000 26
28238 Wall Street Journal 2022-06-27 0 3.000000 255 180 4
28239 Wall Street Journal 2022-07-02 5 12.116667 360 350 16
28240 Wall Street Journal 2022-07-06 2 4.458333 388 110 6

28241 rows × 7 columns

In [44]:
In [45]:
In [46]:
In [47]:
In [48]:
Out[48]:
Text(0.5, 1.0, 'Wordlist Long Score against day')

Through the week, we can see that the long score increases.

In [49]:
Out[49]:
Text(0.5, 1.0, 'Displot of long score against time')
<Figure size 864x360 with 0 Axes>

The max of the long score seems to be increasing.

In [50]:
Out[50]:
Text(0.5, 1.0, "Lineplot of long score of various outlet's crosswords over time")

In general, we can see that long score has been increasing, except in the case of Wall Street Journal, which has been decreasing.

In [51]:
Out[51]:
Text(0.5, 1.0, 'Boxplot of long score against the month')

There seems to be no correlation between month and the quality of crosswords. This is expected.

In [52]:
Out[52]:
Text(0.5, 1.0, 'Boxenplot of long score against outlet')
In [53]:
Out[53]:
Text(0.5, 1.0, 'Number of long answers for each outlet')

We can see that New York Times is the best for long answers, followed by LA Times and Wall Street Journal, then Universal and USA Today. In a similar fashion, New York Times has the most long answers, followed by Wall Street Journal, then LA Times, then Universal, then USA Today.

In [54]:
In [55]:
Out[55]:
Text(0.5, 1.0, 'Boxenplot of long score against year')

No significant trend can be seen with this boxplot against time.

In [56]:

We can see that the long scoring between the two wordlists is similar and does not differ too much. It is thus reasonable to assume that changing the wordlist will not affect the results significantly. Hence our results are not that bad.

Q3. Inclusive Clues

In this section, I will be analysing how inclusive clues are over the years. I will first concatanate all the words in the crossword, then for each word, search for it in the namelist. For each match, assign one point, then we can plot some trends. The two namelists used will be from baby websites. Of course, this method is limited by the namelist, however, with 1000 names for each gender, it should be fairly robust.

In [57]:
In [58]:
Out[58]:
Outlet Date Words
0 L.A. Times Daily 2005-07-02 [A, Natural, Man, singer, RAWLS, Card, Players...
1 L.A. Times Daily 2005-07-03 [, from, New, York, show,, briefly, SNL, Got, ...
2 L.A. Times Daily 2005-07-04 [Fine, studies, ARTS, Not, guilty,, eg, PLEA, ...
3 L.A. Times Daily 2005-07-05 [What, a, relief, WHEW, Is, Born, ASTAR, Actor...
4 L.A. Times Daily 2005-07-06 [Dont, bother, SKIPIT, Even, speak, ASWE, Gran...
... ... ... ...
28289 Wall Street Journal 2022-06-22 [Toosie, Slide, rapper, DRAKE, AWOL, chasers, ...
28290 Wall Street Journal 2022-06-25 [Dude, BRO, Force, Behind, the, Forces, grp, U...
28291 Wall Street Journal 2022-06-27 [Winnie, Pu, first, Latin, bestseller, in, the...
28292 Wall Street Journal 2022-07-02 [2001, computer, HAL, Bravo, OLE, Can, I, get,...
28293 Wall Street Journal 2022-07-06 [Central, Park, in, the, Dark, composer, IVES,...

28294 rows × 3 columns

In [59]:
In [60]:
In [61]:
In [62]:
Out[62]:
Text(0.5, 1.0, "Occurrences of names in various outlet's crosswords over the years")

We can see that, in general, the number of names being used is decreasing. This may also be an effect of accessibility, trying to make it more about words than obscure celebrities. Wall Street Journal was removed from this comparison as it had values too high to scale the graph to be unreadable.

In [63]:
In [64]:
In [65]:

For Wall Street Journal, they have been constantly decreasing the number of names in their crosswords as well. In general, we find that outlets have been including more male names than female names. But, this is not true for USA Today. Surprisingly, we find that now, the number of female names appear more than male names. This interesting observation will be dicussed in greater detail in the results section.

In [66]:
Out[66]:
Text(0.5, 1.0, 'Number of names in crosswords per outlet')

Most outlets use few names in their crosswords, except for the Wall Street Journal, which uses them more often than the rest.

In [67]:
Out[67]:
Outlet Date Words BScore GScore Year Total
0 L.A. Times Daily 2005-07-02 [A, Natural, Man, singer, RAWLS, Card, Players... 4 2 2005 6
1 L.A. Times Daily 2005-07-03 [, from, New, York, show,, briefly, SNL, Got, ... 12 7 2005 19
2 L.A. Times Daily 2005-07-04 [Fine, studies, ARTS, Not, guilty,, eg, PLEA, ... 3 4 2005 7
3 L.A. Times Daily 2005-07-05 [What, a, relief, WHEW, Is, Born, ASTAR, Actor... 4 3 2005 7
4 L.A. Times Daily 2005-07-06 [Dont, bother, SKIPIT, Even, speak, ASWE, Gran... 5 4 2005 9
... ... ... ... ... ... ... ...
28289 Wall Street Journal 2022-06-22 [Toosie, Slide, rapper, DRAKE, AWOL, chasers, ... 2 4 2022 6
28290 Wall Street Journal 2022-06-25 [Dude, BRO, Force, Behind, the, Forces, grp, U... 6 5 2022 11
28291 Wall Street Journal 2022-06-27 [Winnie, Pu, first, Latin, bestseller, in, the... 3 1 2022 4
28292 Wall Street Journal 2022-07-02 [2001, computer, HAL, Bravo, OLE, Can, I, get,... 14 5 2022 19
28293 Wall Street Journal 2022-07-06 [Central, Park, in, the, Dark, composer, IVES,... 5 3 2022 8

28294 rows × 7 columns

Q4. Constructors

Now, we combine the genders of the constructors with the NYT crosswords and do some analysis. This is putting all the results together, making use of everything before. We will use the results from Q1,Q2 and Q3 to assist us in our exploration. First, let us combine all the results into one dataframe.

In [68]:
In [69]:
In [70]:

Also, I want to plot some trends involving Q1, Q2 and Q3, to assist in this question. Unfortunately, no visible trends can be spotted here.

In [71]:
Out[71]:
<seaborn.axisgrid.PairGrid at 0x1e2f04320d0>

For this question, we only have data from the NYT, hence we need to slice the dataframe and merge it with the constructor info.

In [72]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8210 entries, 0 to 8209
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Day        8210 non-null   object        
 1   Date       8210 non-null   datetime64[ns]
 2   C1 No.     8210 non-null   int32         
 3   C1 Gender  8210 non-null   object        
 4   C2 No.     8210 non-null   object        
 5   C2 Gender  8210 non-null   object        
 6   C3 No.     8210 non-null   object        
 7   C3 Gender  8210 non-null   object        
 8   NShortSum  8210 non-null   float64       
 9   AShortSum  8210 non-null   int64         
 10  BShortSum  8210 non-null   int64         
 11  AShort     8210 non-null   float64       
 12  BShort     8210 non-null   float64       
 13  NShort     8210 non-null   float64       
 14  NLong      8210 non-null   float64       
 15  ALong      8210 non-null   float64       
 16  BLong      8210 non-null   float64       
 17  Count      8210 non-null   float64       
 18  Month      8210 non-null   int32         
 19  Year2      8210 non-null   float64       
 20  Words      8210 non-null   object        
 21  BNames     8210 non-null   int64         
 22  GNames     8210 non-null   int64         
 23  Year       8210 non-null   int32         
 24  Total      8210 non-null   int64         
dtypes: datetime64[ns](1), float64(9), int32(3), int64(5), object(7)
memory usage: 1.5+ MB

Gender

In [73]:
Out[73]:
Text(0.5, 1.0, 'Number of crosswords constructed each year for the NYT, split by gender')

We can see that, the crossword scene in the New York Times is primarily male dominated.

In [74]:
Out[74]:
Text(0.5, 1.0, 'Amount of constructors for each day of the week, split by gender')

As the week goes on, more and more male constructors appear, and unfortunately, the number of female constructors decrease. Except for Sunday, which has a similar difficulty to Wed/Thurs puzzles. One may infer that the tougher difficulty may cause more female constructors to not construct.

In [75]:
Out[75]:
Text(0, 0.5, 'NGram short score')
In [76]:
Out[76]:
Text(0.5, 1.0, 'Self-long score against year, split by gender')
In [77]:
Out[77]:
Text(0.5, 1.0, 'Wordlist short score against year, split by gender')
In [78]:
Out[78]:
Text(0.5, 1.0, 'Wordlist long score against year, split by gender')

We can see that crosswordese generally remains similar, but freshness is higher among men

In [79]:
Out[79]:
Text(0.5, 1.0, 'Boxplot of wordlist short score, split by gender')

Generally, using wordlists, there seems to be no trend between the crosswordese amount between men and women.

In [80]:
Out[80]:
Text(0.5, 1.0, 'Boxplot of wordlist long score, split by gender')
In [81]:
Out[81]:
Text(0.5, 1.0, 'Boxplot of own metric long score, split by gender')

Using wordlists and NGrams, men have a higher long answer score then women.

In [82]:
In [83]:

Between men and women, they use similar number of boy's names used. However, women are generally more inclined to use girl names.

Experience

In [84]:
Out[84]:
Text(0, 0.5, 'Short Score')
In [85]:

From the two graphs, we can see that as a constructor makes more puzzles, their "worst" puzzle scores increases, meaning that they become more consistent in making puzzles.

In [86]:
Out[86]:
Text(0.5, 1.0, 'Newer constructors have quite similar short scores to seasoned ones')
In [87]:
Out[87]:
Text(0.5, 1.0, 'Newer constructors have higher long scores to seasoned ones')
In [88]:
Out[88]:
Text(0.5, 0, 'Number of previous puzzles constructed')

As someone constructs more puzzles, the number of names they use decreases. This suggests that they want to make their puzzles more accessible. Additionally, the difference in number of gendered names they use becomes more and more similar, suggesting that they may be striving to be more inclusive.

In [89]:
Out[89]:
Text(0.5, 1.0, 'Constructor number against year')
<Figure size 864x360 with 0 Axes>

This graph just shows that constructor number increases with time, which is an expected trend. It also shows how constructors keep returning to the New York Times. However, a majority also only have 1 puzzle in the New York Times, as we can see in the more darkly colored section closer to 0.

In [90]:
Out[90]:
Text(0.5, 1.0, 'Constructor number against the day of week they get published')
<Figure size 1152x360 with 0 Axes>
In [91]:
Out[91]:
Text(0.5, 1.0, 'Constructor number against the day of week they get published')
<Figure size 1152x360 with 0 Axes>

Unfortuately, it seems that seaborn has a bug which prevents me from sorting the row properly. In general, we find that as the week goes from Monday to Saturday, the number of new constructors decreases. The abnomality here is the Sunday puzzle.

Collaborations

In [92]:
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\457256435.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collab["C2 No."]=collab["C2 No."].astype(int).copy()
Out[92]:
Day Date C1 No. C1 Gender C2 No. C2 Gender C3 No. C3 Gender NShortSum AShortSum ... BLong Count Month Year2 Words BNames GNames Year Total C1 Bin
59 Saturday 1997-03-29 15 F 126 M - - 6.491294 3543 ... 470.0 12.0 3 1995.0 ['A', 'kingdom', 'for', 'Henry', 'V', 'ASTAGE'... 6 2 1997 8 10
190 Friday 1997-08-08 20 M 28 M - - 6.066707 3418 ... 450.0 12.0 8 1995.0 ['Dagnabbit', 'NERTS', 'Enough', 'STOPIT', 'Ki... 3 3 1997 6 20
296 Sunday 1997-11-23 8 M 1 M - - 19.031943 7435 ... 560.0 14.0 11 1995.0 ['two', 'mints', 'INONE', 'Dallas', 'Miss', 'E... 16 13 1997 29 0
308 Friday 1997-12-05 1 M 4 M - - 229.045053 3345 ... 500.0 14.0 12 1995.0 ['Primary', 'Colors', 'author,', 'for', 'short... 5 3 1997 8 0
312 Tuesday 1997-12-09 16 F 127 M - - 6.758687 4398 ... 100.0 5.0 12 1995.0 ['Casablanca', 'role', 'RICK', 'OK,', 'why', '... 3 1 1997 4 10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8196 Monday 2022-06-13 3 M 2 F - - 22.822379 4761 ... 90.0 2.0 6 2020.0 ['', 'spoon', 'fork?', 'ORA', 'Ah,', 'that', '... 3 7 2022 10 0
8199 Thursday 2022-06-16 1 M 51 M - - 47.613915 4342 ... 230.0 9.0 6 2020.0 ['', 'but', 'it', 'seems', 'like', 'you', 'hat... 3 2 2022 5 0
8204 Friday 2022-06-24 8 F 4 F - - 7.066982 3573 ... 510.0 11.0 6 2020.0 ['Stronger', 'than', 'pain', 'sloganeer', 'ADV... 8 5 2022 13 0
8206 Sunday 2022-06-26 14 M 21 M - - 27.225994 7711 ... 420.0 16.0 6 2020.0 ['Despicable', 'Me', 'antihero', 'GRU', 'Hairs... 9 6 2022 15 10
8209 Thursday 2022-06-30 35 M 54 M - - 35.286622 4499 ... 0.0 1.0 6 2020.0 [',', 'in', 'emails', 'URGENT', 'but', 'perhap... 8 5 2022 13 30

680 rows × 26 columns

In [93]:
Out[93]:
Text(0.5, 1.0, 'Scatterplot of the relationship of the puzzle number of collaborators')

There seems to be no correlation between who collaborates with who. However, we can see a clear clustering of points about the x- and y-axes. This suggests that collaborations are mostly used to induct new constructors into the New York Times crossword.

In [94]:
Out[94]:
Text(0, 0.5, 'Number of collaborations')

While the number of daily crosswords have remained similar throughout the years, the number of collaborations have increased. This suggests more openness within the community to induct new people and a greater sense of community.

In [95]:
Out[95]:
Text(0.5, 1.0, 'Number of puzzles published by different pairs of constructors over the years')
In [96]:
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\1712906548.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collab["C1C2 Gender"]=collab["C1 Gender"].copy()+collab["C2 Gender"].copy()
Out[96]:
Text(0.5, 1.0, 'Number of puzzles published by different pairs of constructors over the years')

No clear trends can be seen between the genders of collaborators.

In [97]:
Out[97]:
Text(0.5, 1.0, 'Short score of crosswords over time, split by collaborations')
In [98]:
Out[98]:
Text(0.5, 1.0, 'Long score of crosswords over time, split by collaborations')
In [99]:
Out[99]:
Text(0.5, 1.0, 'Short score of crosswords over time by newer constructors, split by collaborations')
In [100]:
Out[100]:
Text(0.5, 1.0, 'Long score of crosswords by newer constructors over time, split by collaborations')

Collaborations seem to have similar amounts of crosswordese, and increase the freshness of puzzles, which is an overall gain. This is more so seen with newer constructors, which is a benefit, as it makes it easier for them to be accepted by the New York Times.

Results Findings & Conclusion

For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.

For this section, Day 0 refers to Monday, going through the week until Day 6, Sunday.

Q1: Crosswordese

For this section, I define short score as how "good" entries with 7 letters or less are, using a metric of either the scoring given by wordlist makers, or their frequency on Google NGram. For the score, the higher the better. Remember that crosswordese is defined as the amount of obscure fill. The higher the score, the lower the crosswordese.

In [101]:

From the wordlist scoring, we can see that the average score of crosswords decreases throughout the week, albeit minimally. This may signal the puzzle being more and more inaccessible to puzzlers, a phenomenon that is caused by editors wishing to cater to the crossword buffs. The wordlist boxplot would show them trying to cater to both newbies and seasoned puzzlers, by making early week puzzles very accessible, and later week ones harder. This is a reasonable compromise by them.

From the NGram score, we can clearly observe that there are three distinct sections in the data. A possible explanation is that it is the fault of the NGram dataset for being more sparse but I suspect that not to be the case. It may be that crosswords are just inaccessible to the general public if they are not within this community, which may explain why crossword scores are higher when the person grading it is a crossword maker themself, and can thus understand the issues. Hence, the wordlist scores are more closely clustered together, as their scoring is rather similar, not deviating by too much, whilst the NGram scores are more far spread since they represent real world use, which is very very different from crosswords.

In [102]:
Out[102]:
Text(0.5, 0.98, 'Short scores of crosswords over the years, split by outlet')

From the wordlist graph, we can generally infer that, by wordlist standards, all the outlets have their score to be increasing. This graph shows some evidence that the amount of crosswordese has decreased throughout the years. This would mean greater accessibility for the average solver, which knows at least some crosswordese.

From the NGram graph, it seems more erratic, with only USA Today having seen a significant increase. This would suggest a total novice, a person that has never seen a puzzle before, could have a much harder time. A moderate amount of knowledge is needed to break into solving crosswords, but that skill floor has decreased over the years.

The interesting outlier here is USA Today. This outlier is the deliberate action. USA Today crosswords touts themselves as being one of the easier crosswords, being a beginner friendly puzzle. Indeed, it has seem to show that, with a great improvement in accessibility in recent years. This big jump does not imply the superiority over other outlets.

Instead, it showcases a compromise that they have made. Traditionally, crosswords are rotationally or reflectionally symmetrical. However, in recent years, they have given up on symmetry, in favour of better fill and less crosswordese. It can be seen that this method works as a compromise, being less elegant, but being able to induct more solvers into the crossword universe, being an overall positive gain.

Q2: Freshness

For this section, I define long score as a metric for freshness, the higher the long score, the more fresh the puzzle is. Higher freshness is better.

In [103]:
<Figure size 864x360 with 0 Axes>

As the week goes on, the higher and higher the long score. Obviously, for Sunday puzzles, they will have a higher long score with more grid space. However, for the other days of the week, a different explanation is required. A possible reason is the increasing difficulty of crosswords through the week, especially Friday and Saturday, which are themeless puzzles for some outlets. By being unrestricted from any theme, all their long answers must shine, and there must be more of them, since they have more freedom to grid it. This can be seen using both our own metric and the wordlist, confirming the results. What this means is that, the later week puzzles, even though they are less accessible for newbie solvers, veterans will be satisfied to know that the puzzle is waiting for them with many snazzy answers.

In [104]:
Out[104]:
Text(0.5, 1.0, "Lineplot of long score of various outlet's crosswords over time")

Based on this line chart alone, we can tell that the New York Times is the best place for fresh fill. This may explain why they are said to be "the gold standard". For some background information, they have the highest rates in the industry, paying about 500USD per puzzle. This potential profit draws many constructors and makes it so the NYT gets more submissions. Then, they have the luxury to prune only the best, which gives them a competitive edge over the other outlets. While most other outlets generally have not seen their freshness change, the New York Times certainly has, as seen here to be above the rest.

The increasing trend is showing that constructors are constantly raising the bar on what they can do and how fresh the puzzles are. We notice that with the rise of computer construction software, it has never been easier to construct crosswords. Trial and error is no longer required, and now constructors can focus solely on making their crossword the best it can be. In my opinion, this graph does reflect such a shift.

Wall Street Journal's decline may be explained by the fact that they have less Sunday crosswords now, which does affect their long score. However, nowadays, it matches with most other outlets.

Q3: Inclusivity

In [105]:
Out[105]:
Text(0.5, 0.98, 'occurrences of Gendered names in crosswords over the years')

This graph shows the 5 major outlets and how their occurrences of gendered names have changed throughout the years. New York Times, L.A. Times Daily and Wall Street Journal still use more male names than female names of their crossword, whilst Universal and USA Today seem to be closing the gap. The reason for this change is probably deliberate. At the helm of editors who aim to be more inclusive, these crosswords want to reflect society more fully, trying to get more females to do the puzzle.

Even though its less obvious, the New York Times also seems to be trying to be more inclusive, with a dip in the number of males name used. This may be due to some other factors, which will be discussed in question 4.

On a full scale, the number of gendered names seem to be converging such that there are more female ones and less male ones. This should be celebrated, as it reflects a change in perception in the crossword. Since the crossword somewhat reflects who made the puzzle, the greater similarity in names shows how, the crossword is getting more divrse. No longer is the crossword only for men, but now more people can see themselves in it. That does have an influence on how someone feels, when they see something they identify with rather than a baseball team.

In [106]:

It is also interesting to note that, most outlets use generally about the same number of names. However, the clear outlier here is the Wall Street Journal. With more names being used, it may make it harder for someone who does not recognise them to be able to solve the crossword, going back to the issue of crosswordese. Since the number of names being used hovers around 8-9 as a median, even more so from the Wall Street Journal, it is just pertinent that the crosswords contain a diverse set of names, so that no one group of people feel left out. Names are not just another clue, they have the power to make us connect and feel things.

On a side note, it may be that the Wall Street Journal dataset is skewed by the fact that they have more Sunday crosswords, which are bigger, and hence they may contain more names. If this is the case, it probably would not deviate from the general trend of other crosswords too much.

Q4: Constructors

Gender

In [107]:
In [108]:

This first graph shows how the NYT is still very much male-dominated in terms of constructors, even as, since 2020, the number of female constructors are higher than the years before. Most crosswords are still made by male constructors.

The problem is exaserbated when the trend is split by the day of the week. Firstly, notice that the x-axis is generally by the days of the week, except for Sunday. Other than that, since the difficulty of the puzzle increases throughout the week, it is a possible reason why less women construct on those days. As discussed previously, the later days of the week are for crossword buffs, and hence, they may be less inclined to construct for those days, as it is a rather gated community. Earlier days in the week are indeed more accessible to them, which is why they may choose to pick those instead.

This is not to discount them as constructors, however. It may very well be the causes of external factors, like even personal preference that comes down to why they construct those puzzles less. However, it is still a problem, as there is a potential male bias in the fill towards later days, making an already hard puzzle even more inaccessible for a certain gender.

In [109]:
Out[109]:
(3.0, 8.0)

This graph illustrates the importance of diversity in construction. Firstly, male names are used more than female names. Since USA Today has shown that more female names can be used, this is not exactly ideal. However, it still shows an important aspect of construction. Female constructors tend to use more female names than male constructors! Since they have grown up as a female, their life experiences and idols and likes would definitely be different. They bring a piece of their personality into the puzzle, incorporating what males may not characterise as common knowledge. While males have been trying to decrease the number of names they use, female constructors have beem capitalising on it, trying to integrate more of their character.

Experience

In [110]:

We can see that as the week goes on, less and less newer constructors make the puzzle. This may be caused by the relative difficulty of constructing such puzzles, which is known to increase through the week. The abnomaly of Sunday can be explained by its difficulty being more simlar to a Wednesday/Thursday puzzle. As the difficulty increases, so will the puzzlemaking difficulty be. Since they are newer constructors, they have a higher chance of being rejected, and so making an early week puzzle is safer. Jeff Chen of XWordInfo, a crossword site, recommends newbies to not dive straight into constructing late-week puzzles. As discussed before, this low density of newer constructors may be caused by NYT being the gold standard, and it being very difficult to be up to the high standards that they have immediately.

In [111]:
<Figure size 864x360 with 0 Axes>

Newer constructors do have some slight disadvantage when it comes to freshness. But that is to be expected as they are newer to it. These trends are expected and are not surprising. This does justify why its hard for newer constructors to get accepted: they are facing stiff competition. That's not to say that the NYT does not accept them, they still try to aid newer constructors.

Collaborations

In [112]:

As the years have gone by, the number of collaborations have increased. This may be caused by better communication, and the community generally interacting more. More collaborations can benefit everyone. For the constructors, they have a wider range of experiences and can be a more diverse puzzle, being a puzzle for one and all. For solvers, they get to see a more robust puzzle, elevating their solving experience.

In [113]:

Overall, for both seasoned and new constructors, collaborations do not change the amount of crosswordese. This may just be because crosswordese is a necessary evil for "good" crossword puzzles, and hence are unavoidable. What can be controlled by the constructor is the long score, the freshness factor. For both types of constructors, their freshness fill slightly increases when they collaborate with somebody. Although the increase is not significant, we need to look at how these answers are scored. Since we are using a wordlist, an increase of just 10 would signify that the answer is a "better" answer, a more trendy word that is used. Hence, any sort of improvement helps.

Recommendations or Further Works

State any recommendations, improvements or further works.

Recommendations:

For crosswordese, nothing much can be done about it, it seems pretty much set in stone. For freshness, can and should try to do better, with the aid of computer programs to aid us. From the current trend, more inclusivity is needed. Various programs provide valuable mentorship to constructors and assist them to debut in a crossword. The most help is required in late week crosswords. More needs to be done, so that a better picture is reflected of who is in our crosswords. Some outlets have been seeing a shift. For example, the L.A. Times Daily has been increasing the publishing of female constructor's puzzles, trying to keep it above 50%. These sort of actions can hopefully induct more marginalised groups into the crosswords, making it a better community. Of course, some outlets simply cannot do that, as with the New York Times. However, they should try their best, to aid some crosswords that are not exactly up to par by more marginalised constructors to give them a chance. Even though this idea is some what unfair, it also serves as some affirmitive action.

Limitations:

This project has largely relied on my own transformed data. Even though I have made best efforts to consider why and how my transformation of data is justified, it still is not perfect. Much data used in this project was transformed from original data sources, so it may be that they are somewhat inaccurate. Having said that however, some of the findings in this project match up largely with what has already been found elsewhere, hence they can generally be considered reliable. One example of this real limitation is how long word scoring using my own scale was rather iffy. Additionally, the only dataset I could find for Q4 was from the New York Times, which may have biased the results and findings, may not apply to other outlets.

Future works

Further exploration can be done on Q4 for the other outlets, as I am missing that data from the other outlets. A more concrete and objective metric can be used, as my methods are not exactly perfect. As more and more crossword types and formats appear, one can try to do a similar project on other variations of the crossword. A famous and interesting example to try this on would be the cryptic crossword, mostly played in the UK.

Appendix

An appendix is included along with this project, which contains all the web scraping code.

References

Cite any references made, and links where you obtained the data.
  1. https://wordpress.com/support/markdown-quick-reference/ (you may refer to this link on markup for Jupyter when formatting your proposal)
  2. https://pudding.cool/2020/11/crossword/ (article that has tried looking at representation before)
  3. https://noahveltman.com/crossword/ (project that has tried looking at crosswordese before)
  4. https://www.nytimes.com/interactive/2016/02/07/opinion/what-74-years-of-times-crosswords-say-about-the-words-we-use.html (article by the New York Times looking at the evolution of words)

Dataset references

  1. https://www.crosswordgiant.com/browse (website to scrape for the clue answer pairs)
  2. https://www.xwordinfo.com/ (contains more data about the NYT Crossword in particular)
  3. https://books.google.com/ngrams/ (for word searching)
  4. https://peterbroda.me/crosswords/wordlist/lists/peter-broda-wordlist__scored.txt (crossword construction wordlist by Peter Broda)
  5. https://drive.google.com/uc?export=download&id=1Ruxn8XzRNstU6sDPOMm_K72fVookrPPr (crossword construction wordlist by Brooke Husic and Enrique Henestroza Anguiano)
  6. https://www.verywellfamily.com/top-1000-baby-boy-names-2757618 (boy name list)
  7. https://www.verywellfamily.com/top-1000-baby-girl-names-2757832 (girl name list)